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ABSTRACT 

We present a method for performing Principal Component Analysis (PCA) on noisy datasets 
with missing values. Estimates of the measurement error are used to weight the input data such 
that compared to classic PCA, the resulting eigenvectors are more sensitive to the true underlying 
signal variations rather than being pulled by heteroskedastic measurement noise. Missing data is 
simply the limiting case of weight=0. The underlying algorithm is a noise weighted Expectation 
Maximization (EM) PCA, which has additional benefits of implementation speed and flexibility 
for smoothing eigenvectors to reduce the noise contribution. We present applications of this 
method on simulated data and QSO spectra from the Sloan Digital Sky Survey. 

Subject headings: Data Analysis and Techniques 



1. Introduction 

Principal Component Analysis (PCA) is a pow- 
erful and widely used technique to analyze data 
by forming a custom set of "principal component" 
eigenvectors that are optimized to describe the 
most data variance with the fewest number of 
comp onents ( Pearson! 1901 ; Hotellinglll933 ; Jolliffe 



2002). With the full set of eigenvectors the data 
may be reproduced exactly, i.e., PCA is a trans- 
formation which can lend insight by identifying 
which variations in a complex dataset are most 
significant and how they are correlated. Alter- 
nately, since the eigenvectors are optimized and 
sorted by their ability to describe variance in the 
data, PCA may be used to simplify a complex 
dataset into a few eigenvectors plus coefficients, 
under the approximation that higher-order eigen- 
vectors are predominantly describing fine tuned 
noise or otherwise less important features of the 
data. Example applications within astronomy in- 
clude classi fying spectra by fitting them to PCA 



templates ([Paris et al.l 1201 It IConnollv fc Szalav 



1999), describing Hubble S pace Telescope point 
spread function variations (|Jee et al.l 120071) . and 
reducing the dimensionality of Cosmic Micr owave 
Background map data prior to analysis (jBondl 



19951 ). 

A limitation of classic PCA is that it does not 
distinguish between variance due to measurement 
noise vs. variance due to genuine underlying signal 
variations. Even when an estimate of the measure- 
ment variance is available, this information is not 
used when constructing the eigenvectors, e.g., by 
deweighting noisy data. 

A second limitation of classic PCA is the case 
of missing data. In some applications, certain ob- 
servations may be missing some variables and the 
standard formulas for constructing the eigenvec- 
tors do not apply. For example within astron- 
omy, observed spectra do not cover the same rest- 
frame wavelengths of objects at different redshifts, 
and some wavelength bins may be masked due 
to bright sky lines or cosmic ray contamination. 
Missing data is an extreme case of noisy data, 
where missing data is equivalent to data with in- 
finite measurement variance. 

This work describes a PCA framework which 
incorporates estimates of measurement variance 
while solving for the principal components. This 
optimizes the eigenvectors to describe the true 
underlying signal variations without being un- 
duly affected by known measurement noise. Code 
which implements this algorithm is available at 
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https : / / github . com/ sbailey/ empca . 

JoUifTj j2Q02j) §13.6 and §14.2 review prior 



work on PCA with missing data and incorporating 
weights into PCA. Most prior work focuses on the 
identification and removal of outlier data, inter- 
polation over missing data, or special cases such 
as when the weights can be factorized into inde- 
pendent per-obse r vation a nd per- variable weights . 



Gabriel fc Zamirl (Il979h . IWentzell et ail (Il997l) 



Tipping fc Bishorj ( 1993) > and ISrebro fc Jaakkola 



(|2003l ) present iterative solutions for the case of 
general weights, though none of these find the 
true PCA solution with orthogonal eigenvectors 
optimally ranked by their ability to describe the 
data variance. Instead, they find an unsorted set 
of non-orthogonal vectors which as a set are opti- 
mized to describe the data variance, but individu- 
ally they are not the optimal linear combinations 
to describe the most variance with the fewest vec- 
tors. Their methods are sufficient for weighted 
lower-rank matrix approximation, but they lack 
the potential data insight from optimally com- 
bining and sorting the eigenvectors to see which 
components contribute the most variation. 



Wi thin the astronomy literature. IConnollv fc Szalav 
(Il999h discuss how to interpolate over missing 
data and use PCA eigenspectra to fit noisy and/or 
missing data, but they do not address the case of 
how to genera te eigenspectra from noisy but non- 
missing data. Blanton fc Roweisl (|2007l ) generate 
template spectra from noisy and missing data us- 
ing Non-negative Matrix Factorization (NMF). 
This method is similar to PCA with the con- 
straint that the template spectra are strictly pos- 
itive, while not requiring the template s to be or- 
thonormal. iTsalmantza fc Hogg! (|2012l ) present a 
more general "Heteroskedastic Matrix Factoriza- 
tion" approach to study Sloan Digital Sky Survey 
spectra while properly accounting for measure- 
ment noise. Their underlying goal is similar to 
this work, though with an algorithmically differ- 
ent implementation. 

The methods presented here directly solve for 
the PCA eigenvectors with an iterative solution 
based upon Expectat ion Maximization PCA (EM- 
PCA). R OWCl sl (|l997t ) describes an unweighted ver- 
sion of EMPCA, including a method for interpo- 
lating missing data, but he does not address the 
issue of deweighting noisy data. We also take ad- 
vantage of the iterative nature of the solution to 



bring unique extensions to PCA, such as noise- 
filtering the eigenvectors during the solution. 

The approach taken here is fundamentally prag- 
matic. For example, if one is interested in gener- 
ating eigenvectors to describe 99% of the signal 
variance, it likely doesn't matter if an iterative al- 
gorithm has "only" converged at the level of 10 -5 
even if the machine precision is formally 10~ 15 . 
We discuss some of the limitations of Weighted 
EMPCA in gSJ but ultimately we find that these 
issues are not limiting factors for practical appli- 
cations. 

This work was originally developed for PCA 
analysis of astronomical spectra and examples are 
given in that context. It should be noted, how- 
ever, that these methods are generally applicable 
to any PCA application with noisy and/or missing 
data — nothing in the underlying methodology is 
specific to astronomical spectra. 

2. Notation 

This paper uses the following notation: Vec- 
tors use arrows, x, while Xi represents the scalar 
element i of vector x. For sets of vectors, Xj repre- 
sent vector number j (not element j of vector x). 
Matrices are in boldface, X. To denote vectors 
formed by selected columns or rows of a matrix, 
we use X$ for the vector formed from column j 

of matrix X and X™ w for the vector formed from 
row i of matrix X. The scalar element at row i 
column j of matrix X is X^ j . 

For reference, we summarize the names of the 
primary variables here: X is the data matrix with 
n V ar rows of variables and n b s columns of observa- 
tions. P is the PCA eigenvector matrix with n var 
rows of variables and n vcc columns of eigenvec- 
tors; 4> is a single eigenvector. These eigenvectors 
may fit the data using a matrix of coefficients C, 
where Cjy is the contribution of eigenvector k to 
observation j. Indices i, j, and k index variables, 
observations, and eigenvectors respectively. X is 
the vector formed by stacking the columns of ma- 
trix X. The measurement covariance of dataset X 
is V, while W is the weights matrix for dataset 
X for the case of independent measurement noise 
such that W has the same dimensions as X. 
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3. Classic PCA 



4. Adding Weights to PCA 



sec 



For an acc essible tutorial on classic PCA 
Schlensl (l2009h. A m uch m ore complete treatment 
is found in Ijolliffd (l2002h . Al gorithmically, the 
steps are simple: The principal components {4>k} 
of a dataset are simply the eigenvectors of the co- 
variance of that dataset, sorted by their descend- 
ing eigenvalues. A new observation y may be ap- 
proximated as 



(i) 



where jl is the mean of the initial dataset and Ci 
is the reconstruction coefficient for eigenvector tfii. 
For the rest of this paper, we will assume that jl 
has already been subtracted from the data, i.e., 

v <- (y-fl)- 

To find a particular coefficient cy , take the dot 
product of both sides with (j>y , noting that because 
of the eigenvector orthogonality, <pk ■ <pk> — Skk' 
(Kroeneker-delta) : 



k 



^ Ck$kk' 



k 
Cfc' 



(2) 

(3) 
(4) 



Note that the neither the solution of {4>k} nor 
{cfc} makes use of any noise estimates or weights 
for the data. As such, classic PCA solves the min- 
imization problem 



x 2 = 



Et x - pc ^ 



(5) 



where X is a dataset matrix whose columns are 
observations and rows are variables, P is a ma- 
trix whose columns are the principal components 
{4>k} to find, and C is a matrix of coefficients 
to fit X using P. For clarity, the dimensions 
of these matrices are: X[n var , n bs], Pfavar, n vec ], 
and C[n VG c,nob s ], where n obs , n var , and n voc are 
the number of observations, variables, and eigen- 
vectors respectively. For example, when perform- 
ing PCA on spectra, n b s is the number of spectra, 
rt V ar is the number of wavelength bins per spec- 
trum, and n vec is the number of eigenvectors used 
to describe the data, which may be smaller than 
the total number of possible eigenvectors. 



The goal of this work is to solve for the eigen- 
vectors P while incorporating a weights matrix W 
on the dataset X: 



X 



^W«[X-PC] 



(6) 



We also describe the more general cases of per- 
observation covariances V., : 



obs j 



X" 



p C col 



p C col 



(7) 

where we have used the notation that X™ 1 is the 
vector formed from the jth column of the matrix 
X, and similarly for Cj ■ In the most general 
case, there is covariance V between all variables 
of all observations, i.e., we seek to minimize 



X 2 = X-[P]C V- 1 X-[P]C 



(8) 



Where X and C are the vectors formed by con- 
catenating all columns of X and C, and [P] is the 
matrix formed by stacking P n b s times. 

This allows one to incorporate error estimates 
on heteroskedastic data such that particularly 
noisy data does not unduly influence the solu- 
tion. We will solve this problem using an iterative 
method known within the statistics community as 
"Expectation Maximization" . 

5. Weighted Expectation Maximization 
PCA 

5.1. Expectation Maximization PCA 

Expectation Maximization (EM) is an iterative 
technique for solving parameters to maximize a 
likelihood function for models with unknown hid- 



den f or latent) variables (|Dempster, Laird, k, Rubin 
19771 ). Each iteration involves two steps: finding 
the expectation value of the hidden variables given 
the current model (E-step), and then modifying 
the model parameters to maximize the fit likeli- 
hood given the estimates of the hidden variables 
(M-step). 

As applied to PCA, the parameters to solve are 
the eigenvectors, the latent variables are the coeffi- 
cients {c} for fitting the data using those eigenvec- 
tors, and the likelihood is the ability of the eigen- 
vectors to describe the data. To solve the single 
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most significant eigenvector, start with a random 
vector <fi of length n var - For each observation Xj, 



solve the coefficient c. 



which best fits 



that observation using <f>. Then using those coeffi- 
cients, update (j> to find the vector that best fits the 
data given those coefficients: <— J2j c j^j/ Ej c "j ■ 
Then normalize <f> to unit length and iterate the so- 
lutions to {c} and <$> until converged. In summary: 

4> <— random vector of length n va r 



repeat until converged: 

For each observation Xj : 



<-E,-cj%/E 



3 3 



E-step 
M-step 
Renormalize 



return 



This generates a vector <f> which is the dominant 
PCA eigenvector of the dataset X, where the ob- 
servations Xj are the columns of X. The expecta- 
tion step finds the coefficients {cj} which best fit 
X using (j) (see equations [2] to S} . The likelihood 
maximization step then uses those coefficients to 
update 4> to minimize 



X 



E 



(9) 



obs j 



In practice the normalization y\ - c 2 in the M- 

step is unnecessary since <j) is renormalized to unit 
length after every iteration. 

At first glance, it can be surprising that this al- 
gorithm works at all. Its primary enabling feature 
is that at each iteration the coefficients {&,} and 
vector (j> minimize the \ 2 better than the previous 
iteratio n. The y 2 of equation [9 has a single min- 
imum (jSrebro fc Jaakkolall2003l ). thus when any 
minimum is found it is the true global minimum. 
It is possible, however, to also have saddle points 
to which the EMPCA algorithm could converge 
from particular starting points. It is easy to test 
solutions for being at a saddle point and restart 
the iterations as needed. 

The specific convergence criteria are applica- 
tion specific. One pragmatic option is that the 
eigenvector itself is changing slowly, i.e., \A(j>\ < e. 
Alternately, one could require that the change in 
likelihood (or A% 2 ) from one iteration to the next 
is below some threshold. Convergence and unique- 
ness will be discussed in Ei8.II and £18.21 For now 



we simply note that many PCA applications are 
interested in describing 95% or 99% of the data 
variance, and the above algorithm typically con- 
verges very quickly for this level of precision, even 
for cases when the formal computational machine 
convergence may require many iterations. 

To find additional eigenvectors, subtract the 
projection of <f> from X and repeat the algorithm. 
Continue this procedure until enough eigenvectors 
have been solved that the remaining variance is 
consistent with the expected noise of the data, or 
until enough eigenvectors exist to approximate the 
data with the desired fidelity. If only a few eigen- 
vectors are needed for a large dataset, this algo- 
rithm can be much faster than classic PCA which 
requires solving for all eigenvectors whether or not 
they are needed. Scaling performance will be dis- 
cussed further in ^18.41 



5.2. EMPCA with per-observation weights 

The above algorithm treats all data equally 
when solving for the eigenvectors and thus is 
equivalent to classic PCA. If all data are of approx- 
imately equal quality then this is fine, but if some 
data have considerably larger measurement noise 
they can unduly influence the solution. In these 
cases, the high signal-to-noise data should receive 
greater weight than low signal-to-noise data. This 
is conceptually equivalent to the difference be- 
tween a weighted and unweighted mean. 

In some applications, it is sufficient to have 
a single weight per observation so that all vari- 
ables within an observation are equally weighted, 
but different observations are weighted more or 
less than others. In this case, EMPCA can be 
extended with per-observation weights w;. The 
observations X should have their weighted mean 
subtracted, and the likelihood maximization step 
(M-step) is replaced with: 



W 3 C 3 X 3 

3 



(10) 



The normalization denominator has been dropped 
because we re-normalize <f> to unit length every 
iteration. 

5.3. EMPCA with per-variable weights 

If each variable for each observation has a dif- 
ferent weight, the situation becomes more com- 
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X 



P c 



matrix form, X = PC can be solved for each in- 
dependent observation column j of X and C: 



^£col _ p C c .°^ 

Fig. 1. — Illustration of solving C one column at 
a time, i.e., equation H"3l 

plicated since we cannot use simple dot products 
to derive the coefficients {cj}. Instead, one must 
solve a set of linear equations for {cj}. Similarly, 
the likelihood maximization step must solve a set 
of linear equations to update <j> instead of just per- 
forming a simple sum. The weighted EMPCA al- 
gorithm now starts with a set of random orthonor- 
mal vectors {<f>k} and iterates over: 

1. For each observation Xj, solve coefficients 

k 

2. Given {ckj}, solve each 4>k one-by-one for k 
in l..rivec: 

Xj - ^ Ck>j$k' = Ck 3 4>k (12) 

k'<k 

Both of the above steps can be solved using 
weights on Xj , thus achieving the goals of properly 
weighting the data while solving for the coefficients 
{ckj} and eigenvectors (f>k- Implementation details 
will be described in the following two sub-sections, 
where we will return to using matrix notation. 

5.3.1. Notes on solving {ckj} = C 

In equation 111! the <f>k vectors are fixed and 
one solves the coefficients Ckj with a separate set 
of equations for each observation Xj. Written in 



X>* = PC™ 1 + noise (13) 

Equation [T3] is illustrated in figure [T] 

Solving equation[l3]for C™ 1 with noise- weighting 
by measurement covariance Vj is a straight- 
forward linear least-squares problem which may be 
solved with Singular Value Decomposition (SVD), 
QR factorization, conjugate gradients, or other 
methods. For example, using the method of "Nor- 
mal equations''^ and the shorthand x = X™ 1 and 
c=Cf: 

x = Pc (14) 
V _1 £ = V x Pc (15) 

p T v- 1 f = (p^v^p^ie) 

(pTv-'Py^V-'x = c (17) 

(18) 

If the noise is independent between variables, 
the inverse covariance V -1 is just a diagonal ma- 
trix of weights "Wj. Note that the covariance 
here is the estimated measurement covariance, not 
the total dataset variance — the goal is to weight 
the observations by the estimated measurement 
variance so that noisy observations do not unduly 
affect the solution, while allowing PCA to describe 
the remaining signal variance. 

In the more general case of measurement co- 
variance between different observations, one can- 
not solve equation Q2] for each column of X in- 
dependently. Instead, solve X = [P]C with the 
full covariance matrix V of X, where [P] is the 
matrix formed by stacking P n bs times, and X 
and C are the vectors formed by stacking all the 
columns of the matrices X and C. This requires 
the solution of a single (n b s • ^vcc) x (^obs • ^vcc) 
matrix rather than n D b s solutions of n vcc x ri vcc 
matrices. If the individual observations are un- 
corrected, it is computationally advantageous to 
use this non-correlation to solve multiple smaller 
matrices rather than one large one. 
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As with section 15.3.11 if there are measurement 
covariances between the data, equation [19] may be 
expanded to solve for all elements of P| o1 simul- 
taneously using the full measurement covariance 
matrix of X. 

After solving for P|° , subtract its projection 
from the data: 



X = pcol ® C£° W 

-j^tow — P Tj ^row 

Fig. 2. — Illustration of solving P one element at 
a time, i.e., equation 1201 

5.3.2. Notes on solving {4>k} = P 

In the second step of each iteration (equation 
[T2"]) . we use the fixed coefficients C (dimensions 
n vcc x n Q bs) and solve for the eigenvectors P (di- 
mensions n var x n vec ). We solve the eigenvectors 
one-by-one to maximize the power in each eigen- 
vector before solving the next. Selecting out just 
the fcth eigenvector uses the fcth column of P and 
the fcth row of C: 

X = P c k ol ®Cl ow (19) 

where ® signifies an outer product. If the variables 
(rows) of X are independent, then we can solve for 
a single element of P™ 1 at a time: 

X™ w = P tt Cr (20) 

This is illustrated in figure [5J With independent 
weights W™ w on the data X' ow , we solve variable 
i of eigenvector k with: 

p ifc = t±l 3 3 (21) 



Note that this method is mathematically correct but nu- 
merically unstable. It is included here for illustration, but 
the actual c alculation should use one of the other methods 
l|Press et al.ll2002h . 



X <- X - P c k ° l <g> C^ ow (22) 

This removes any variation of the data in the di- 
rection of P™ 1 so that additional eigenvectors will 
be orthogonal to the prior onesH Then repeat the 
procedure to solve for the next eigenvector k + 1. 

6. Extensions of Weighted EMPCA 

The flexibility of the iterative EMPCA solution 
allows for a number of powerful extensions to PCA 
in addition to noise weighting. We describe a few 
of these here. 

6.1. Smoothed EMPCA 

If the length scale of the underlying signal 
eigenvectors is larger than that of the noise, it may 
be advantageous to smooth the eigenvectors to re- 
move remaining noise effects. The iterative nature 
of EMPCA allows smoothing of the eigenvectors 
at each step to remove the high frequency noise. 
This generates the optimal smooth eigenvectors by 
construction rather than smoothing noisy eigen- 
vectors afterward. This will be shown in the ex- 
amples in section [7] Alternately, one can include 
a smoothing prior or regularization term when 
solving for the principal components P. That 
approach, however, requires solving equation [19] 
(plus a regularization term) for all elements of P™ 1 
simultaneously instead of using the numerically 
much faster equation [21] for the case of diagonal 
measurement covariance. 

6.2. A priori eigenvectors 

In some applications, one has a few a priori 
template vectors to include in the fit, e.g., from 
some physical model. The goal is to find addi- 
tional template vectors which are to be combined 



2 Note that this approach is potentially susceptible to build 
up of machine rounding errors and should be explicitly 
checked when using EMPCA for solving a large number 
of eigenvectors. 
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Fig. 3. — Example of noisy data used to test Fig. 4. — Examples of classic PCA and EMPCA 

weighted EMPCA. applied to noiseless, noisy, and missing data. 



with the a priori vectors for the best fit of the 
data. Due to noise weighting and the potential 
non-orthogonality of the a priori vectors, the best 
fit is a joint fit and one cannot simply fit the a 
priori vectors and remove their components be- 
fore proceeding with finding the other unknown 
vectors. 

This case can be incorporated into EMPCA by 
including the a priori vectors in the starting vec- 
tors P and simply keeping them fixed with each 
iteration rather than updating them. In each iter- 
ation, the coefficients for the a priori vectors are 
updated, but not the vectors themselves. 

7. Examples 
7.1. Toy data 

Figure [3] shows example noisy data used to test 
weighted EMPCA. 100 data vectors were gener- 
ated using 3 orthonormal sine functions as input, 
with random amplitudes drawn from Gaussian dis- 
tributions. The lower frequency sine waves were 
given larger Gaussian sigmas such that they con- 
tribute more signal variance to the data. Gaus- 
sian random noise was added, with 10% of the 
data vectors receiving 25 times more noise from 
[0, 7r/2] and 5 times more noise from [n/2, 2tt]. 
For Weighted EMPCA, weights were assigned as 
1/er 2 , where a is the per-observation per- variable 
Gaussian sigma of the added noise (not the sigma 
of the underlying signal). A final dataset was cre- 
ated where a contiguous 10% of each observation 



was set to have weight=0 to create regions of miss- 
ing data. As a crosscheck that the weights are 
correctly applied, the data in these regions were 
set to a constant value of 1000 - if these data are 
not correctly ignored by the algorithm they will 
have a large effect on the extracted eigenvectors. 

Figure @] show the results of applying classic 
PCA and weighted EMPCA to these data. Upper 
left: Classic PCA applied to the noiseless data re- 
covers the input eigenvectors, slightly rotated to 
form the best ranked eigenvectors for describing 
the data variance. Upper right: EMPCA applied 
to the same noiseless data recovers the same in- 
put eigenvectors. Middle left: When classic PCA 
is applied to the noisy data, the highest order 
eigenvector is dominated by the noise, and the ef- 
fects of the non-uniform noise are clearly evident 
as increased noise from [0, tt/2]. Middle right: 
Weighted EMPCA is much more robust to the 
noisy data, extracting results close to the origi- 
nal eigenvectors. The highest order eigenvector 
is still affected by the noise, which is a reflec- 
tion that the noise does contribute power to the 
data variance. However, the extra-noisy region 
from [0, 7r/2] is not affected more than the region 
from [ir/2, 2ir], due to the proper deweighting of 
the noisy data. Lower left: Smoothed weighted 
EMPCA is almost completely effective at extract- 
ing the original eigenvectors with minimal impact 
from the noise. Lower right: Even when 10% of 
every observation is missing, smoothed weighted 
EMPCA is effective at extracting the underlying 
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Fig. 5. — Example of high-, median-, and low-S/N 
input QSO spectra and a broad absorption line 
(BAL) QSO (left). The hrst 3 classic PC A eigen- 
vectors are shown at top right; the hrst 3 weighted 
EMPCA eigenvectors are shown at bottom right. 



eigenvectors. All eigenvectors for all methods are 
orthogonal at the level of £>(10~ 17 ). 

7.2. QSO data 

Figure [5] shows the results of applying clas- 
sic PCA and weighted EMPCA to QSO spectra 
fr om the Sloan Digital S ky Survey Data Release 
7 ( Abazaiian et al. 20091). using t he QSO redshift 
catalog of iHewett fc Wild! (|2010h . 500 spectra of 
QSOs with redshift 2.0 < z < 2.1 were randomly 
selected and trimmed to 1340 < A < 1620 A to 
show the SilV and CIV emission features. Spec- 
tra with more than half of the pixels masked 
were discarded. Each spectrum was normalized 
to median [flux (1440 < A < 1500 A)] = 1 and the 
weighted mean of all normalized spectra was sub- 
tracted. The left panel of Fig. [5] plots examples of 
high, median, and low signal-to-noise spectra and 
a broad absorption line (BAL) QSO from this sam- 
ple. ^2% of the spectral bins have been flagged 
with a bad-data mask, e.g., due to cosmic rays, 
poor sky subtraction, or the presence of non-QSO 
narrow absorption features from the intergalactic 
medium. These are treated as missing data with 
weight=0. The goal of weighted EMPCA is to 
properly deweight the noisy spectra such that the 
resulting eigenvectors are predominantly describ- 
ing the underlying signal variations and not just 
measurement noise. Weights are where cry 



is the SDSS pipeline estimated measurement noise 
for wavelength bin i of spectrum j. Weighted EM- 
PCA can also properly ignore the masked data by 
using weight=0 without having to discard the en- 
tire spectrum, or artificially interpolate over the 
masked region. 

The right panels of Fig. [5] show the results for 
the first 3 eigenvectors of classic PCA (top right) 
and weighted EMPCA (bottom right). Eigenvec- 
tors 0, 1, and 2 are plotted in blue, green, and red 
respectively. The mean spectrum was subtracted 
prior to performing PCA such that these eigenvec- 
tors represent the principal variations of the spec- 
tra with respect to that mean spectrum. Eigen- 
vectors are orthogonal at the level of 0(1O~ 17 ). 

The EMPCA eigenvectors are much less noisy 
than the classic PCA eigenvectors. As such, they 
are more sensitive to genuine signal variations in 
the data. For example, the sharp features between 
1515 < A < 1545 A in the EMPCA eigenspec- 
tra arise from BAL QSOs, an example of which is 
shown in the bottom of the left panel of Fig. [5] 
The se feature s are used to study QSO outflows, 
e.g. Turnshek ( 19881 ). yet they are lost amidst the 
noise of the classic PCA eigenspectra. Similarly, 
the EMPCA eigenspectra are more sensitive to the 
details of the variations in shape and location of 
the emi ssion peaks used to study QSO metallic- 
ity ( e.g. Juarez et al.l ( 20091) ) and black hole mass 
(e.g. IVestergaard fc Osmerl ( 20091) ). 



8. Discussion 



8.1. Convergence 



McLachlan fc Krishnanl (|1997| ) discuss the con- 
vergence properties of the EM algorithm in gen- 
eral. Each iteration, by construction, finds a set 
of parameters that are as good or better a fit to 
the data than the previous step, thus guaranteeing 
convergence. The caveat is that the "Likelihood 
maximization step" is typically implemented as 
solving for a stationary point of the likelihood 
surface rather than strictly a maximum, e.g., 
d£/d<fi — is also true at saddle points and min- 
ima of the likelihood surface, thus it is possible 
that the EM algorithm will not converge to the 
true global maximum. Unweighted PCA has a 
likelihood surface with a single global maximum, 
but in general this is not the case for weighted 
PCA: the weights in equation [8] can result in lo- 
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cal false \ 2 minima (Srebro fc Jaakkolal |2003) 
McLachlan fc Krishnan ( 19971 ) §3. 6 also giv es ex- 



amp les of this behavior (taken fromlMurravl (|1977l ) 
and lArslan. Constable, fc Kent! (|1993l) ) for the 
closely related problem of Factor Analysis. The 
example datasets are somewhat contrived and the 
minimum or saddle point convergence only hap- 
pens with particular starting conditions. 

We have encountered false minima with weighted 
EMPCA when certain observations have ~90% of 
their variables masked while giving large weight 
to their remaining unmasked variables. In this 
case the resultant eigenvectors can have artifacts 
tuned to the highly weighted but mostly masked 
input observations. When only a few (~10%) of 
the variables are masked per observation, we have 
not had a problem with false minima. 

The algorithm outlined in section [3] solves for 
each eigenvector one at a time in order to max- 
imize the power in the initial eigenvectors. This 
can result in a situation where a given iteration 
can improve the power described by the first few 
eigenvectors while degrading the total x 2 using all 
eigenvectors. We have not found a case where this 
significantly degrades the global x 2 , however. 

Th e speed of con vergence is also not guaran- 
teed. Roweisl ( 1997 ) gives a toy example of fast 
convergence for Gaussian-distributed data (3 iter- 
ations), and slow convergence for non-Gaussian- 
distributed data (23 iterations). In practice we 
find that when EMPCA is slow to converge, it 
is exploring a shallow likelihood surface between 
two nearly degenerate eigenvectors. This situa- 
tion pertains to the uniqueness of the solution, 
described in the following section. 

Weighted EMPCA may produce unstable solu- 
tions if it is used to solve for more eigenvectors 
than are actually present in the data, or for eigen- 
vectors that are nearly singular. Since EMPCA 
uses all eigenvectors while solving for the coeffi- 
cients during each iteration, the singular eigenvec- 
tors can lead to delicately balanced meaningless 
values of the coefficients, which in turn degrades 
the solution of the updated eigenvectors in the 
next iteration. We recommend starting with solv- 
ing for a small number of eigenvectors, and then 
increasing the number of eigenvectors if the result- 
ing solution does not describe enough of the data 
variance. 




-0.15 
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Fig. 6. — Example eigenspectra of the same 
dataset with different random starting values for 
the EMPCA algorithm. The eigenspectra are off- 
set for clarity and have bin-for-bin agreement at 
the level of ~10~ 5 . 

For these reasons, one should use caution when 
analyzing data with EMPCA, just as one should 
do with any problem which is susceptible to false 
minima or other convergence issues. In practice, 
we find that the benefits of proper noise- weighting 
outweigh the potential convergence problems. 

8.2. Uniqueness 

Given that EMPCA is an iterative algorithm 
with a random starting point, the solution is not 
unique. In particular, if two eigenvalues are very 
close in magnitude, EMPCA could return an ad- 
mixture of the corresponding eigenvectors while 
still satisfying the convergence criteria. In prac- 
tice, however, EMPCA is pragmatic: if two eigen- 
vectors have the same eigenvalue, they are also 
equivalently good at describing the variance in the 
data and could be used interchangeably. 

Science applications, however, generally require 
strict algorithmic reproducibility and thus EM- 
PCA should be used with a fixed random number 
generator seed or fixed orthonormal starting vec- 
tors such as Legendre polynomials. The conver- 
gence criteria define when a given vector is "good 
enough" to move on to the next iteration, but they 
do not guarantee uniqueness of that vector. 

Figure |6] shows the first 3 eigenvectors for 5 
different EMPCA solutions of the QSO spectra 
from section 17.21 with different random starting 
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vectors. After 20 iterations the eigenvectors agree 
to < 10 -5 on both large and small scales. Al- 
though this agreement is worse than the machine 
precision of the computation, it is much smaller 
than the scale of differences between the eigenvec- 
tors and it represents a practical level of conver- 
gence for most PCA applications. 

8.3. Over-weighted Data 

Weighted EMPCA improves the PCA eigenvec- 
tor solution by preventing noisy or missing data 
from unduly contributing noise instead of signal 
variation. However, the opposite case of high 
signal-to-noise data can also be problematic if just 
a few of the observations have significantly higher 
weight than the others. These will dominate the 
EMPCA solution just as they would dominate a 
weighted mean calculation. This may not be the 
desired effect since the initial eigenvectors will de- 
scribe the differences between the highly weighted 
data and subsequent eigenvectors will describe the 
lower weighted data. This may be prevented by 
purposefully down-weighting certain observations 
or applying an upper limit to weights so that the 
weighted dataset isn't dominated by just a few ob- 
servations. 

8.4. Scaling Performance 

The primary advantage of EMPCA is the abil- 
ity to incorporate weights on noisy data to im- 
prove the quality of the resulting eigenspectra. A 
secondary benefit over classic PCA is algorithmic 
speed for the common case of needing only the first 
few eigenvectors from a dataset with n b s ~ n val . 

Classic PCA requires solving the eigenvectors 
of the data covariance matrix, an 0(n^ ar ) opera- 
tion. The weighted EMPCA algorithm described 
here involves iterating over multiple solutions of 
smaller matrices. Each iteration requires n b s so- 
lutions of 0(n^ cc ) to solve the coefficients and 
C(^obs^vec^var) operations to solve the eigenvec- 
tors. Thus weighted EMPCA can be faster than 
classic PCA when n itcl: (n ohs nl cc + n ohs n vcc n vllI ) < 
n^ ar , ignoring the constant prefactors. If one has 
a few hundred spectra (n b s ) with a few thousand 
wavelengths each (n var ) and wishes to solve for the 
first few eigenvectors (n vec ), then EMPCA can be 
much faster than classic PCA. Conversely, if one 
wishes to perform PCA on all ~1 million spectra 



from SDSS, then n Q b s 3> n vai and classic PCA 
is faster, albeit with the limitations of not be- 
ing able to properly weight noisy or missing data. 
If the problem involves off-diagonal covariances, 
then weighted EMPCA involves a smaller number 
of larger matrix solutions for an overall slowdown, 
though note that classic PCA is unable to properly 
solve the problem at all. 

As a performance example, we used EMPCA 
to study the variations in the simulated point 
spread function (PSF) of a new spectrograph de- 
sign. The PSFs were simulated on a grid of 11 
wavelengths and 6 slit locations, and were sam- 
pled over 200 x 200 /iin spots on a 1 /im grid, for 
a total of 40000 variables per spot. Classic PCA 
would require singular value decomposition of a 
40000 x 40000 matrix. While this is possible, it 
is beyond the scope of a typical laptop computer. 
On the other hand, using EMPCA with constant 
weights we were able to recover the first 30 eigen- 
vectors covering 99.7% of the PSF variance in less 
than 6 minutes on a 2.13 GHz MacBook Air lap- 
top. 

For datasets where n var is particularly large, the 
memory needed to store the n var x n va r covariance 
matrix may be a limiting factor for classic PCA. 
The iterative nature of EMPCA allows one to scale 
to extremely large datasets since one never needs 
to keep the entire dataset (nor its covariance) in 
memory at one time. The multiple independent 
equations to solve in sections 15.3.11 and 15.3.21 are 
naturally computationally parallelizable. 

9. Python Code 

Python code implementing the Weighted EM- 
PCA algorithm described here is available at 
https : //github . com/ sbailey/ empca . The cur- 
rent version implements the case of independent 
weights but not the more generalized case of 
off-diagonal covariances. It also implements the 
smoothed eigenvectors described in i )6.1[ but not 
a priori eigenvectors f £|6.2[) . nor distributed calcu- 
lations (UEHj). For comparison, the empca module 
also includes implementations of classic PCA and 
weighted lower-rank matrix approximation. Ex- 
amples for this paper were prepared with tagged 
version vO . 2 of the code. 

When using the code, note that the orientation 
of the data and weights vectors is the transpose 
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of the notation used here, i.e., data[j ,i] is vari- 
able i of observation j so that data[j] is a single 
observation. 

10. Summary 

To briefly summarize the algorithm: A data 
matrix X can be approximated by a set of eigen- 
vectors P with coefficients C: 



PC + measurement noise 



(23) 



A covariance matrix V describes the estimated 
measurement noise. 

The Weighted EMPCA algorithm seeks to find 
the optimal P and C given X and V. It starts 
with a random set of orthonormal vectors P, and 
then iteratively alternates solutions for C given 
{P,X,V} and P given {C,X,V}. The problem 
is additionally constrained by the requirement to 
maximize the power in the fewest number of eigen- 
vectors (columns of P). To accomplish this, the 
algorithm solves for each eigenvector individually, 
before removing its projection from the data and 
solving for the next eigenvector. If the measure- 
ment errors are independent, the covariance can be 
described by a weights matrix W with the same 
dimensions as X, and the problem can be factor- 
ized into independent solutions of small matrices. 

This algorithm produces a set of orthogonal 
principal component eigenvectors P, which are op- 
timized to describe the most signal variance with 
the fewest vectors while properly accounting for 
estimated measurement noise. 

11. Conclusions 

We have described a method for performing 
PCA on noisy data that properly incorporates 
measurement noise estimates when solving for 
the eigenvectors and coefficients. Missing data 
is simply the limiting case of weight=0. The 
method uses an iterative solution based upon Ex- 
pectation Maximization. The resulting eigen- 
vectors are less sensitive to measurement noise 
and more sensitive to true underlying signal vari- 
ations. The algorithm has been demonstrated 
on toy data and QSO spectra from SDSS. Code 
which implements this algorithm is available at 
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